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Abstract 

Register integration (or just integration) is a register 
renaming discipline that implements instruction reuse via 
physical register sharing. Initially developed to perform 
squash reuse, the integration mechanism can exploit 
more reuse scenarios. Here, we describe three extensions 
to the original design that expand its applicability and 
boost its performance impact. First, we extend squash 
reuse to general reuse. Whereas squash reuse maintains 
the concept of an instruction instance "owning" its out- 
put register, we allow multiple instructions to simulta- 
neously share a single register. Next, we replace the PC- 
indexing scheme with an opcode-based indexing scheme 
that exposes more integration opportunities. Finally, we 
introduce an extension called reverse integration in 
which we speculatively create integration entries for the 
inverses of operations— for instance, when renaming an 
add, we create an entry for the inverse subtract. Reverse 
integration allows us to reuse operations that the pro- 
gram itself has not executed yet. We use reverse integra- 
tion to implement speculative memory bypassing for 
stack-pointer based loads (register fills and restores). 

Our evaluation shows that these extensions increase 
the integration rate — the number of retired instructions 
that integrate older results and bypass the execution 
engine — to an average of 15% on the SPEC2000 integer 
benchmarks. On a 4-way superscalar processor with an 
aggressive memory system, this translates into an aver- 
age IPC improvement of 7%. The fact that integrating 
instructions completely bypass the execution engine 
raises the possibility of using integration as a low-com- 
plexity substitute for execution bandwidth and issue buff- 
ering. Our experiments show that such a trade-off is 
possible, enabling a range of IPC/ complexity designs. 

1 Introduction 

Register integration (or just integration) is a modifica- 
tion to register renaming that implements instruction 
reuse via physical register sharing [11]. Like other reuse 
schemes, integration enhances performance by cutting 
observed latencies, collapsing reused dependence chains, 
reducing contention for execution bandwidth and issue 
buffers, and accelerating branch resolution. Integration 
does have one unique feature among reuse schemes: it 
accomplishes all of this without reading or writing the 
registers themselves (for the rest of this paper we will use 
the word register to mean physical register). 

Integration was initially designed to capture two reuse 
scenarios: squash reuse [11, 13] and pre-execution reuse 



[12]. These forms of reuse exploit certain invariants to 
enable a simple and un-obtrusive integration implementa- 
tion. In this paper, we present three extensions to the 
basic implementation that broaden integration's applica- 
bility and increase its performance impact while main- 
taining simplicity and modularity. First, we extend 
squash reuse to general reuse by allowing multiple 
instruction instances to share the same register simulta- 
neously. We accomplish this using a register reference 
counting scheme. General reuse enables the integration 
of registers which are the outputs of instructions which 
have been squashed, are in-flight, have retired, or have 
retired and been architecturally overwritten. This exten- 
sion increases the integration rate — the number of retire- 
ment stream instructions that benefit from integration — 
from 2% to 9%. Next, we present an opcode-based index- 
ing scheme that exposes more integration opportunities 
while minimizing integration table conflicts. Opcode 
indexing increases the integration rate to approximately 
12%. Our final, and most significant, proposed extension 
is reverse integration. In reverse integration, the renam- 
ing of an operation triggers the creation of an integration 
entry for the inverse operation: an add creates an entry for 
the complementary subtract, a store creates an entry for 
the complementary load, and so on. Reverse integration 
can achieve dataflow graph compression beyond that 
which is possible via direct (i.e., conventional, repetition- 
based) reuse. In this paper, we use reverse integration to 
implement speculative memory bypassing [9] for stack 
loads — register fills and restores — essentially for free. 
With the addition of reverse integration, the number of 
instructions that benefit from integration rises to 15%. 
We evaluate these extensions using cycle level simulation 
of the SPEC2000 integer benchmarks. On a 4-way super- 
scalar processor with an aggressive memory system, we 
observe average speedups of 7%, with 13% gains on sev- 
eral benchmarks. 

Integrating instructions bypass the out-of-order exe- 
cution engine raising the possibility of using integration 
as a substitute for execution core resources. The trade-off 
of integration complexity for execution complexity is 
potentially a good one. Integration has been shown to be 
amenable to pipelining and insensitive to pipeline latency 
[13]. We show that, in terms of IPC, it can substitute for 
both execution width and scheduling window size. 

The next two sections recap basic register integration 
and present our extensions, respectively. Section 4 con- 
tains both limit studies and evaluations of realistic inte- 
gration configurations. Section 5 discusses related work. 
Our conclusions are presented in Section 6. 
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2 Register Integration Primer 

Register integration was initially developed to imple- 
ment squash reuse [11] and pre-execution reuse [12]. 
Since we are working in a superscalar context, we present 
our extensions assuming a base squash-reuse mechanism; 
specifically, its more refined second implementation [13]. 
We briefly recap that mechanism here. 

The integration operation. Register integration is an 
extension to pointer-based register renaming, the style 
used in MIPS R10000 [17], Alpha 21264 [6], and Pentium 
4 [4]. Integration allows multiple dynamic instruction 
instances to use the same register instance for their shared 
result. Reuse (sharing) is accomplished by pointer manip- 
ulation: the reusing instruction sets its output logical regis- 
ter to point to the register containing the original value. 
Integration identifies reuse opportunities by performing an 
operational equivalence test on each instruction as it is 
renamed. An instruction may reuse the result of a previous 
instruction if it performs the same operation (heretofore 
represented by PC) on the same registers. To facilitate 
such a comparison, an integration table (IT) stores opera- 
tion, input_pregl, input_preg2, output_preg> tuples of recent 
instructions. One unique feature of integration is that nei- 
ther the reuse operation nor the reuse test require any reg- 
ister values to be read or written. 

Major components and organization. Figure 1 shows 
the main components of integration and their logical 
placement in the pipeline. The integration components are 
the integration logic (a modification to renaming), the 
integration table (IT), the register state vector, the load 
integration suppression predictor (LISP) and the DIVA 
verifier [1]. We have already introduced the integration 
table. The register state vector maps each register to one of 
three states: free, active, or squashed. The vector indicates 
which registers are integration-eligible (only squashed 
registers may be integrated) and also acts as the free list. 

Integration is a multi-step process. A group of instruc- 
tions reads the IT to generate a group of IT entries. The IT 
entries is internally cross-checked to determine the possi- 
bility of integrating dependence chains. During register 
renaming itself, the map table and register state vector are 
read. The information from the IT, map table, and state 
vector is combined by the integration logic to make inte- 
gration decisions. These are reflected by changes to the 
map table and state vector and the creation of new IT 
entries (for failed integrations). Of these, only the integra- 



FIGURE 1. Register integration implementation 




tion logic forms a critical loop with register renaming. 

Integrating instructions bypass the out-of-order execu- 
tion engine completely and are not allocated reservation 
stations. System calls, stores (whose execution enables 
load forwarding), and direct jumps (whose decode-time 
execution is essentially free) are not integrated. 

Mis-integrations. Mis-integrations — the integrations of 
incorrect results — are a rare but inevitable byproduct of 
integration. There are two kinds of mis-integrations. Load 
mis-integrations occur because the integration test is 
purely register based. In a load mis-integration, a load 
integrates despite the presence (or absence) of a conflict- 
ing store that did not (or did) exist for the original load. 
Register mis-integrations occur when new mappings coin- 
cidental^ match stale IT entries. 

Mis-integrating instructions may not be retired. This 
directive may be implemented conservatively, by avoiding 
potential mis-integrations a priori, or aggressively, by 
detecting mis-integrations and recovering from them. 
Used by itself, neither approach is satisfactory. The con- 
servative solution requires IT invalidations and drastically 
depresses integration rates. The aggressive solution 
requires integrating instructions to be re-executed and pro- 
duces many expensive mis-integrations. 

These circumstances motivate a combined approach 
[13]. First, we detect all mis-integrations by re-executing 
integrating instructions in-order prior to retirement. This 
form of re-execution is both cheaper and less perfor- 
mance-critical than execution by the out-of-order core. 
DIVA [1] re-executes all instructions in this way to toler- 
ate many kinds of faults, including design errors. If DIVA 
is present, it provides us with free re-execution and mis- 
integration detection. Then, with re-execution ensuring 
correctness, we use simple mechanisms to suppress most 
mis-integrations while keeping integration rates high. 
Load mis-integrations are functions of store-load depen- 
dences, and thus are highly predictable. We suppress them 
using a load integration suppression predictor (LISP). The 
LISP learns from past mis-integrations to suppress the 
future integration of offending loads. We suppress register 
mis-integrations using a simple generation counting 
scheme which we describe in Section 3.1. 

3 Three Extensions 

The extensions we propose involve only minimal and 
localized modifications to the "existing" (i.e., squash 
reuse) integration machinery. General reuse requires 
changes to the register state vector and IT. Opcode index- 
ing and reverse integration require IT changes only. 

3.1 General Reuse via Multiple Integration 

Squash reuse is the reuse of results that are created 
during the course of (mis-)speculation; it is a function of 
speculation in the microarchitecture and the control-recon- 
vergent nature of the program. General reuse is the reuse 
of results generated by older architectural instructions and 
is a function of the dynamic redundancy built into pro- 
grams by compilers and programmers. In PC-based gen- 
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eral reuse, instructions reuse the results generated by older 
instances of themselves. Loop-invariant instructions that 
were not hoisted by the compiler and program-constant 
based instructions (e.g., loop initialization and control) in 
successive invocations of the same function are common 
fodder for PC-based general reuse. 

The primary implementation change from squash to 
general reuse is the introduction of simultaneous register 
sharing. In squash reuse, multiple dynamic instructions 
share a single register output, but not simultaneously. An 
integrating instruction (i.e., its output logical register) 
assumes ownership of the integrated register. There is no 
need to track how many times a register is mapped — that 
number is always one. Mapping (logical register) transi- 
tions unilaterally trigger register transitions (e.g., the free- 
ing of a mapping triggers the freeing of a register) without 
checking the state vector. In general reuse, a register may 
be simultaneously mapped by multiple logical registers, 
some of which may be the outputs of in-flight instructions. 
General reuse precludes the notion of register ownership 
and the resulting simplifications (e.g., a register can be 
reclaimed only when the last mapping to it is freed). 

To facilitate simultaneous register sharing, we general- 
ize the contents of the register state vector to reference 
counts. Each register's entry is the number of active map- 
pings to that register. An active mapping is either in-flight 
or retired, but not shadowed/overwritten. In other words, it 
can be read by any new instruction. Mapping operations — 
allocations or integrations — increment the count. Unmap- 
ping operations — squashes or overwrites — decrement it. A 
register is free when its reference count is zero. Note, the 
retirement of an instruction does not change the reference 
count of its output register. 

Our scheme requires that we distinguish between two 
different zero-reference states. One corresponds to the 
squash reuse free state and is interpreted as "the register 
contains a garbage value." The other corresponds to the 
squash reuse squashed state and interpreted as "this regis- 
ter is currently unused but does contain a useful value and 
is integration-eligible." Ordinarily, the second state alone 
would suffice. We could allow registers that contain gar- 
bage to be integrated and detect the resulting mis-integra- 
tions. However, the presence of squash reuse necessitates 
the first state. On a mis-speculation, we flush squashed 
instructions that have not executed from the reservation 
stations. Now, integrating instructions are not allocated 
reservation stations under the assumption that either 1) the 
result is ready, or 2) an older in-flight instruction will write 
this register. If we allow registers from squashed un-exe- 
cuted instructions to be integrated, the corresponding 
operation will never execute, the integrating instruction 
will never complete, and the processor will deadlock 
before the offending instruction enters the re-execution 
stage. While we can detect (and recover from) this condi- 
tion, this scenario arises too frequently for such a low-per- 
formance solution. To represent two zero-reference states, 
we augment the reference count with a valid bit. The bit is 
set for all integration-eligible registers, i.e., all except for 
unmapped registers of the first kind. 



Working example. General reuse allows combinations of 
active and retired instructions to share registers. There are 
also multiple scenarios in which sharing is partially or 
wholly dissolved. Our reference counting and resource 
management scheme handles these cases naturally. For 
intuition, we show a few common scenarios in an exam- 
ple. Figure 2 shows the processing of eight dynamic 
instructions at three relevant pipeline events: rename, com- 
mit, and squash. From left to right, the figure shows the 
event, the instruction's dynamic instance number (#1 to 
#8), its PC, its raw and renamed forms, and the post-event 
states of the IT, map table, and reference vector. Map table 
and reference vector rows are "snapshots." A given IT row 
shows the entry relevant to the particular operation. 

Our example uses three logical registers, R1-R3, and 
six (physical) registers, pl-p6. R1-R3 are initially mapped 
to pl-p3, each of which is in the 1/T state; p4-p6 are free 
and are in the 0/F state. The first six events show the 
renaming and retirement of instructions #l-#3. Since 
these do not match any IT entries, three new registers, p4- 
p6, are allocated to them. In the reference vector, these reg- 
isters transition from 0/F to 1/T; map table and reference 
vector transitions are shown in bold. Notice, when an 
instruction retires, its own output register does not change 
state. However, the reference count of the shadowed regis- 
ter (the one previously mapped to the output logical regis- 
ter) is decremented. For instance, in event #3, instruction 
#l's output register, p4, is unchanged while p2, the register 
previously mapped to R2, transitions to 0/T. Recall, the 0/T 
state implies that the register contains a valid, integration- 
eligible value, but is not currently in use. 

Events #7 and #8 are integrations. Instructions #4 and 
#5 are new instances of the xlO and xl4 and integrate the 
results of instructions #1 and #2 — p4 and p5 — respectively. 
Integrations trigger reference increments. The integration 
scenarios for instructions #4 and #5 are slightly different. 
Instruction #4 integrates a register which has been shad- 
owed by the retirement of instruction #3; its count transi- 
tions from 0/T to 1/T. Instruction #5 integrates a register 
whose mapping has been committed but not overwritten; 
its reference transition is from 1/T to 2/T. This is an 
instance of simultaneous sharing: p5 is shared by the 
retired mapping of instruction #2 and the active mapping 
of instruction #5. 

Instruction #6 (event #9) cannot integrate an existing 
result. p2, a 0/T register, is reclaimed and allocated to it. 

In event #11, instruction #5 and all subsequent instruc- 
tions (here, #6) are squashed. In a conventional processor, 
a squash restores the map table and free list to their state 
immediately prior to the renaming of the oldest squashed 
instruction (#5 here). In a processor with integration, this 
recovery procedure is applied to the map table and refer- 
ence vector. In the example, these are restored to their pre- 
event #8 state. To accommodate squash reuse, the restora- 
tion function is not an exact copy. Special logic is applied 
to entries of registers which are completely unmapped by 
the squash. This logic transitions the register to the 0/T 
state if the corresponding instruction has executed, or to 
the 0/F state if it has not. As noted above, this is done to 
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FIGURE 2. General reuse reference counting example 



prevent registers of un-executed squashed i 
from being integrated and causing deadlock. In our exam- 
ple, p2 transitions to the 0/F state. Notice, pS (2/T to 1/T) is 
not completely unmapped by the squash; the squash does 
not destroy p5's mapping from retired instruction #2. 

Events #12 and #13 are integrations of registers p4 and 
p5 by instances of instructions xlO and xl4, respectively. 
These are cases of simultaneous sharing — each reuse reg- 
ister has at least one active mapping at the time it is 
reused. As shown, our mechanism handles general reuse in 
the presence of shadowing and mis-speculation. 

Issue: speculative reference counting. While integration 
is a performance optimization and precise IT management 
is unnecessary, the reference vector is the central tracking 
mechanism for all registers. Its state must be kept pre- 
cisely lest registers "leak." The solution, which we alluded 
to in our example, parallels the handling of the free list in a 
conventional processor. The output register numbers con- 
tained in the ROB are used to undo reference increments 
serially on a mis-speculation. For faster recovery to select 
dynamic points (e.g., after conditional branches), the ref- 
erence vector is checkpointed and restored monolithically. 

Issue: IT/reference vector management. Squash reuse 
exploits an invariant one-to-one correspondence between 
integration-eligible registers and IT entries to manage the 
IT and state vector in synchrony. Joint management maxi- 
mizes integration opportunity but requires transitions in 
one structure to perform lookups in the other. We manage 
the IT and reference-vector independently. Combining 
LRU IT replacement with circular (FIFO) register recla- 
mation approximates coordinated replacement. At the 
same time, we simplify implementation and gain the flexi- 
bility to use multiple IT entries per register. This flexibility 
is important for implementing reverse integration. 

Issue: avoiding register mis-integrations. Register mis- 
integrations are rare in squash reuse, where integration-eli- 



gible entries are flushed before the right mappings can 
accidentally recur. They are frequent in general reuse, 
where nearly all registers are integration-eligible and 
many persist in the IT for long periods. Unlike load mis- 
integrations, register mis-integrations are "random" and 
hence not easily predicted/avoided. 

A complete but expensive solution to register mis-inte- 
grations is to invalidate all IT entries which specify a reg- 
ister as one of the inputs whenever that register is 
reallocated. A practical approximation is to attach to each 
register a short wrap-around generation counter. This 
counter is incremented every time the register is reallo- 
cated, but is otherwise unmodified. The counters are stored 
in the map table and reference vector and are checkpointed 
and restored together with these structures. In the IT, regis- 
ter numbers are augmented with counters which are cop- 
ied from the map-table (along with the register numbers) 
when an entry is created. To simulate invalidation, we inte- 
grate only if both register numbers and counter values 
match. We have found that 4-bit counters eliminate virtu- 
ally all register mis-integrations. 

3.2 More Reuse via Enhanced Opcode Indexing 

PC-indexing is appropriate for squash-reuse where, by 
definition, instructions integrate the results of squashed 
instances of themselves. For general reuse, it is too restric- 
tive. To establish operational equivalence, only the opcode 
and input values (registers and immediates) are needed. 
PC matching is sufficient to establish operation and imme- 
diate value equivalence, but it is not strictly necessary. Dif- 
ferent static instructions may have identical combinations 
of opcode, immediate, and inputs (e.g., loop control 
instructions from different functions are nearly identical). 
Under PC-indexing, instances of one cannot integrate 
results generated by instances of the other. To recapture 
some of this lost opportunity, we "relax" IT indexing to 
use opcodes rather than PCs. Although this is a stand- 
alone extension, its primary benefit comes from enabling 
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reverse integration (Section 3.3). 

Opcode-indexing maximizes integration opportunity, 
but for realistic, low-associativity IT organizations it has a 
serious disadvantage. The opcode itself distributes IT 
entries poorly, inducing conflicts which reduce the integra- 
tion rate, and undermining the initial motivation for using 
opcode- indexing in the first place! Combining the opcode 
and immediate to form the index relieves this problem, but 
only slightly — many dynamic instructions have opcode/ 
immediate combinations of ldq/0, addqi/1, or addq/-. 

To truly mitigate aliasing, we augment the index in a 
structured way, by mixing (XOR'ing) an additional piece 
of information with the opcode/immediate. Note, only the 
index is augmented — a minimal tag (opcode/immediate) is 
used to maximize matches within a set. To be effective, a 
piece of information must generate a sufficient number of 
distinct patterns. Furthermore, distinct patterns should 
group instructions that are likely to integrate one another's 
results, and each instruction within a group should be able 
to generate the pattern easily and independently. After 
experimenting with several indexing additions including 
logical register names and high-order PC bits, we have 
found that using the call depth — e.g., the top-of-stack 
index of the return-address-stack — yields a good distribu- 
tion and the highest integration rates. Call depth indexing 
has several nice properties. It groups instructions by func- 
tion (both statically and dynamically), exploiting the fact 
that instructions are more closely related to, and hence 
more likely to integrate results from, other instructions 
from within the same function, and in particular the same 
dynamic invocation. It is a dense numbering of small inte- 
gers that generates few conflicts outside the current func- 
tion. Finally, it meshes well with reverse integration. 

3.3 Memory Bypassing via Reverse Integration 

Squash and general reuse perform direct integration: 
integration of results from older instructions. They exploit 
passive, reactive dynamic instruction repetition, and buffer 
results in the IT under a simple temporal locality assump- 
tion: the operation is likely to be executed again soon. 

Reuse has a more aggressive, active cousin: pre-execu- 
tion. In pre-execution, we use the execution of one opera- 
tion to predict a different (but closely related) operation 
that is likely to execute in the near future, execute that 
operation speculatively, and buffer its result for later 
"reuse". In this scenario, reuse is a misnomer — the reused 
operation was not previously specified by the original pro- 
gram. Pre-execution exploits a different locality assump- 
tion: the presence of a certain operation signals the arrival 
of a closely related operation. 

Register integration efficiently supports a restricted but 
powerful class of pre-execution idioms via a mechanism 
called reverse integration. In reverse integration, the 
renaming of an operation triggers the creation of an IT 
entry for the inverse operation. To create this entry, we 
simply invert the opcode/immediate combination, and 
reverse the roles of the output register and one input regis- 
ter. For example, suppose we rename the instruction: addqi 
p3, pi, 4. Creating the IT entry <addqi/4, pi, -, p3> allows us 



to reuse future instances of addqi ?, pi, 4. With this exten- 
sion, we can also create a reverse entry <addqi/-4, p3, -, pl> 
and integrate future instructions of the form addqi ?, p3, -4. 

The applicability of reverse integration depends on the 
frequency of operation-inverse pairs. At first, it may 
appear that such pairs are rare; after all, why would a pro- 
gram perform the inverse operation when it had the value 
produced by this inverse to begin with? However, there is 
at least one common idiom that follows this pattern: mem- 
ory communication, the passing of values from stores to 
loads. Stores and loads are trivial inverses with respect to 
the stored value. Speculatively short-circuiting store-load 
communication — reusing the store's data input register as 
the load's data output register — is a known technique 
called speculative memory bypassing [9]. 

The basic reverse integration implementation is sim- 
ple: when renaming a store stq pi, 8(p2), we create the IT 
entry for the complementary load <ldq/8, -, p2, pl>. The 
structure of the reverse entry restricts the communicating 
store-load pair somewhat: the store and load must share 
the same base address register (p2 here). Fortunately, a sig- 
nificant number of store-load communications follow this 
more restricted pattern as well: saves and restores into the 
stack-frame which use the stack-pointer as their base reg- 
ister. Speculative memory bypassing for save-restore pairs 
is straightforward as long as the stack-pointer itself is not 
modified. However, we can make it work even across 
stack-pointer modifications by exploiting the observation 
that, by design, stack-pointer modifications themselves 
always come in nested operation-inverse pairs: e.g., Ida sp, 
-32(sp) and Ida sp, 32(sp) (Alpha-speak for addqi sp, sp, -32 
and addqi sp, sp, 32, respectively). When a restore operation 
takes place, the stack-pointer always has the same value as 
it did when the corresponding save executed. By using 
reverse integration on the stack-pointer itself, we create 
the situation in which this value is also in the same regis- 
ter. Notice, speculative memory bypassing via reverse 
integration meshes well with our opcode-indexing and 
entry distribution mechanisms: save-restore pairs are 
always from the same function and stack depth, as are the 
stack-pointer decrement-increment pairs. 

Working example. Figure 3 shows reverse register inte- 
gration at work, implementing speculative memory 
bypassing for both a caller- and a callee- saved register. 
The figure shows a time series of the register renaming 
stage. From left to right are the raw (un-renamed) instruc- 
tion stream, the renamed instructions, the IT (with relevant 
reverse entries) and the state of the map table after the cur- 
rent instruction has been renamed. Execution proceeds in 
three phases. In the save sequence, the caller-saved regis- 
ter tO is saved (1), the called function opens a stack frame 
by decrementing the stack pointer (3), and then saves the 
callee-saved register sO (4). For each of these three opera- 
tions, we create a reverse integration entry. For the stores 
we create load entries with the instruction's data input reg- 
ister as the entry's output. For the stack-pointer decrement, 
the reverse entry contains a positive immediate and the 
input and output registers are swapped. The second phase 
is of unspecified length and contains the body of the called 
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Dynamic Instruction Stream 



l# Raw 

1 stq tO, 8(sp) 

2 call function 

3 Ida sp, -32(sp) 

4 stq sO, 4(sp) 



Idqs0,4(sp) 
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8 Idq tO, 8(sp) 



Renamed 
stq p20, 8(p12) 
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FIGURE 3. Speculative memory bypassing via reverse integration 

between integration and execution core complexity (4.5). 



function in which tO and sO are overwritten. The third 
phase takes place around function return. The callee- 
restore (5) integrates the data register of the callee-save 
(p22) using the reverse entry created by that store. Integra- 
tion succeeds because the stack-pointer (p31) is not modi- 
fied between the two instructions. The stack-pointer 
increment (6) integrates the reverse entry of the stack- 
pointer decrement, restoring sp's pre-call mapping to pl2. 
This reverse integration enables the reverse integration of 
the caller-restore (8). 

Issue: non-standard stack disciplines. Reverse integra- 
tion captures the most frequent stack idiom: FIFO pushing 
and popping of function calls. However, several idioms — 
exceptions, longjmp, and alloca — manipulate the stack 
pointer in non-standard ways. These do not result in incor- 
rect behavior, but do temporarily disrupt reverse integra- 
tion by removing a complementary increment to an 
existing stack-pointer decrement from the dynamic 
instruction stream. Reverse integration resumes productive 
operation when new values are saved to (and subsequently 
restored from) the stack. 

4 Evaluation 

We evaluate our extensions using cycle-level simula- 
tion. We measure the impact of each extension on a 4-way 
superscalar processor (Section 4.2), analyze integrating 
instructions (4.3), measure the performance of various 
integration configurations (4.4), and explore the trade-off 



4.1 Environment 

We conduct our evaluation using the SPEC2000 inte- 
ger benchmarks. The benchmarks are compiled for the 
Alpha EV6 using the Digital UNIX V4 cc compiler with 
the SPEC peak optimization flags: -03 -fast. We simulate 
the training runs to completion with 10% cyclic sampling 
at a granularity of 100 million instructions per sample. Our 
simulation environment is built using the SimpleScalar 
Alpha ISA and system modules. The simulator faithfully 
models pointer-based register renaming and register inte- 
gration. Table 1 details our simulated configuration. 

4.2 Primary Performance Results 

Performance impact of our three integration exten- 
sions — general reuse, opcode-indexing, and speculative 
memory bypassing — is shown in Figure 4: the top graph 
shows speedups, the bottom one details the corresponding 
integration metrics. Each graph shows eight experiments 
grouped into four bars: squash (first bar from left) is the 
baseline squash reuse implementation [11, 13], +general 
adds general reuse, +opcode adds opcode-indexing, and 
+reverse adds speculative memory bypassing. Within a 
bar, one experiment uses a realistic LISP (bottom, light 
portion), and one uses oracle mis-integration suppression 
(top, dark portion). For integration rates, solid bars repre- 
sent direct integrations and striped bars reverse integra- 
tions. Integration rates are measured at retirement to avoid 



TABLE 1. Simulated processor configuration 



Pipeline 


4-way superscalar, dynamically scheduled processor with a 13 stage pipeline (3 fetch, 1 decode, 1 rename, 2 schedule, 2 
register read, 1 execute, 1 writeback, 1 DIVA, 1 retire). Maximum of 128 instructions or 64 memory operations in-flight. 
8K-entry hybrid branch predictor with 4K-entry BTB. 40 reservation-station scheduler issues up to 4 instructions per 
cycle: 2 simple integer, 2 floating-point or complex-integer, 1 load, and 1 store. Loads issue speculatively with full 
squash on mis-speculation. 256-entry collision history table (CHT) stalls chronically mis-speculated loads. 


Memory 
System 


32KB, 32B line, 2-way primary instruction and data caches. 2MB, 64B line, 4-way, 6-cycle L2. Infinite, 80-cycle main 
memory. 128-entry 4-way TLBs with 30 cycle hardware miss handling. 32B wide backside and memory buses clocked at 
IX and 0.25X processor frequency, respectively. Data cache access is 2 cycles and non-blocking with 16 MSHRs. Mem- 
ory operations are preceded by 1 -cycle address generation, minimal latency of a non-integrating load is 3 cycles. 


Integration 


256 registers. lK-entry, 4-way IT contains direct and reverse entries and is indexed by XOR of instruction's opcode, 
immediate value and call-depth. 4-bit generation counters and lK-entry, 2-way PC-indexed LISP suppress register and 
load mis-integrations, respectively. DIVA re-executes all instructions. Mis-integrations trigger a 1-cycle pipeline flush. 
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counting integrations by squashed instructions and double 
counting integrations by instructions that integrated and 
were subsequently squashed and squash reused. In the bot- 
tom graph, the number at the top of each bar is unsup- 
pressed mis-integrations (i.e., DIVA induced squashes) per 
one million retired instructions. This number corresponds 
to the realistic LISP configuration. In the top graph, the 
number under each program is its baseline IPC. 

Extension contribution. For squash reuse (squash) to 
provide benefit, the processor must control- or data- mis- 
speculate at a sufficient rate and execute a sufficient num- 
ber of instructions from the re-convergent portion of the 
mis-speculated path. With our moderate pipeline depth 
and issue width and aggressive branch and load specula- 
tion predictors, these conditions are not present. Squash 
reuse achieves a mean (arithmetic) integration rate of 2% 
and a mean (geometric) speedup of 1%. Higher integration 
rates and speedups have been measured using deeper pipe- 
lines and smaller predictors [11, 13]. As previously 
reported, mis-integrations are uncommon in squash reuse. 

The addition of general reuse (+general) increases the 
average integration rate to 8% (9% with oracle mis-inte- 
gration suppression) and speedup to 2.8% (3% oracle). 
Unlike the squash integration rate, the general integration 
rate is a function of the program and the integration con- 
figuration. It is independent of the underlying microarchi- 
tecture and can produce tangible speedups even with a 
modest pipeline and.accurate control speculation. Unsur- 
prisingly, mis-integrations increase proportionally with 
integrations. These are almost exclusively load mis-inte- 
grations; register mis-integrations are virtually eliminated 
using our 4-bit generation counters. 

Enhanced opcode indexing (+opcode) increases the 
average integration rate to 11.5% (12% oracle) and the 
average speedup to 4.8% (5% oracle). Again, the increase 
in mis-integration rate is proportional to the increase in 
integration rate. Unlike general reuse, opcode indexing 



does not benefit all programs uniformly. Recall, opcode 
indexing produces a poorer a priori IT distribution for 
which we compensate using the call depth as an additional 
index. For this enhancement to work, a program must be 
sufficiently call-intensive and have a sufficiently deep call- 
graph (to produce multiple stack depth values). For most 
benchmarks, this strategy breaks even and produces mod- 
est integration rate increases of around 1%. Crafty, perls, 
and vortex have both the requisite call structure and multi- 
ple static instructions within the same function whose 
dynamic instances can successfully integrate one another's 
results. These show increases of nearly 10%. On the other 
end of the spectrum vpr.r and (to a lesser degree) gzip have 
few integration opportunities across multiple instructions 
within the same function. For these programs, PC-index- 
ing would suffice. Unfortunately, they also has few calls. 
Poor IT entry distribution dominates in these benchmarks 
and integration rates drop by about 2%. 

While opcode indexing itself does not result in signifi- 
cant gains, it does enable reverse integration (+reverse). 
Speculative memory bypassing lifts the mean integration 
rate to 15% (17% oracle) and the mean speedup to 7.3% 
(8.3% oracle). Applying reverse integration to save-restore 
pairs, we improve call-intensive benchmarks by integrat- 
ing 60% of stack-loads. Not surprisingly, the same call- 
poor programs which react adversely to opcode indexing 
(gzip, and vpr.r) also do not exploit reverse integration. On 
the other hand, call-intensive programs like eon.k, gcc, 
perl, and vortex have reverse integration rates that 
approach (and often surpass) 10%. Surprisingly, while 
reverse integration increases integration rates, it actually 
reduces the average mis-integration rate. This is an artifact 
of one "outlier" program. Crafty has an unusually high 
mis-integration rate for direct integrations while its reverse 
integrations mis-integrate infrequently. Reverse entries 
displace direct entries from the IT, disproportionately cut- 
ting the mis-integration rate. 



FIGURE 4. Impact of general reuse, opcode indexing, and speculative memory bypassing 
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Performance diagnostics. Integration's primary benefit is 
the streamlining of the execution stream — integrating 
instructions bypass the execution engine. By skipping half 
the pipeline, an integrating instruction's lifetime is effec- 
tively halved. In many cases, this is the dominant term in 
the integration performance equation. Integration has sec- 
ond-order performance effects as well. Integrating instruc- 
tions indirectly accelerate non-integrating instructions by 
removing themselves from scheduling contention. Integra- 
tion also expedites the resolution of mis-predicted 
branches. Mis-prediction resolution latency, measured as 
the average cycle difference between resolution (comple- 
tion) and prediction for all retired mis-predicted branches, 
is reduced from an average of 26 cycles to 23.5 cycles. 
Faster mis-prediction resolution reduces the number of 
instructions fetched along mis-speculated paths and helps 
offset some of the fetch redundancy caused by mis-inte- 
gration. Integration actually reduces the average number 
of fetched instructions slightly (an average of 0.6%). 

4.3 Integration Stream Analysis 

To better understand integration, we study the integra- 
tion retirement stream: the stream of retiring integrating 
instructions. Figure 5 shows three integration stream 
breakdowns. As usual, solid bars indicate direct integra- 
tion and striped bars indicate reverse integration. On top of 
each benchmark name, we print the integration rate. The 
data corresponds to our baseline configuration: a lK-entry, 
4-way IT, 256 registers, and a realistic LISP. 

Integration distance. The left graph (Distance) measures 
the distance in renamed instructions between the integrat- 
ing instruction and the instruction that created the IT entry 
and result. This measure of distance indicates the number 
of cycles that pass between the creation of an IT entry and 
its use and shows the number of integrations that would be 
lost if integration were pipelined. Pipelining integration 
separates the IT read and write stages, preventing instruc- 
tions from integrating recently allocated registers. 

Fewer than 10% of integrating instructions use results 
created within the previous four instructions and fewer 
than 20% integrate results that were created within the pre- 
vious 16 instructions. In a 4-wide machine, integration 
may be pipelined over four stages with a maximum reduc- 
tion in the integration rate of 20%. Loss is capped at 20% 



because many "lost" integrations are likely to be of the 
squash reuse variety and squash reuse is impervious to 
integration pipelining. While the squashed and integrating 
instances may be separated by only ten instructions in the 
dynamic renaming stream, they are also separated by a 
pipeline flush. Intuitively, the majority of reverse integra- 
tions take place over longer instruction distances. 

Integration distance can also be used to measure inte- 
gration locality. For that, it must be defined as the distance 
between the instruction and the most instruction to use the 
register (which is not necessarily the original creator). In 
the next section, we investigate locality in a different but 
equivalent way, by varying register file sizes. 

Integration-time result status. The middle graph (Sta- 
tus), shows the state of the result at the time the integrating 
instruction was renamed. We distinguish between four 
states: rename (the integrated register was allocated, but 
the corresponding operation has not been issued), issue 
(the operation has been issued), retire (the operation has 
completed and the original instruction has retired), and 
shadow/squash (the operation completed but the register 
was unmapped at the time of integration; the original 
instruction was either squashed or shadowed, i.e., retired 
and overwritten) 

This graph demonstrates two of the benefits of integra- 
tion. First, 10-20% of integrations occur before the origi- 
nal instruction has executed. These reuse opportunities 
cannot be captured by value- or name- based mechanisms 
like instruction reuse (IR) [14, 15] since the reused value 
itself is unavailable. Second, most reverse integrations 
take place after the instruction that created the stored value 
has retired (sum of the bottom two striped portions). This 
illustrates the importance of a bypassing implementation 
that can operate outside the reordering window. 

Integration-time reference count. The right graph (Ref- 
count) tracks reference counts at the time of integration. 
This breakdown illustrates both the degree of register shar- 
ing in the program and the number of bits required for 
each reference vector entry. At the bottom of the stack are 
instructions whose integration increments the reference 
count to 1, next are those whose integrations increment the 
reference count to at most 3, and so on. These correspond 
to maximum sharing degrees enabled by 1-bit counters, 2- 



FIGURE 5. Breakdowns of integration retirement stream 
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bit counters, etc. The bars corresponding to a reference 
count of 1 show integrations of squashed or shadowed 
results. Bars corresponding to reference counts greater 
than 1 show integration of instructions which are still in- 
flight or are retired but not overwritten. 

Simultaneous sharing is frequent. Nearly 60% of inte- 
grations occur while the original instruction is still active. 
However, fewer than 20% of integrated results are simulta- 
neously shared by more than three instructions. While 4- 
bit counters capture virtually all sharing opportunities, it is 
not the case that 2-bit counters would preclude as many as 
20% of integrations (e.g., gzip). If an instruction attempts 
to integrate a register with a saturated reference counter, 
integration fails and the instruction allocates a new register 
and a new IT entry. Subsequent instructions will integrate 
this new register, whose reference count is 1. 

4.4 Impact of Integration Configuration 

In the previous section, we measured the performance 
impact of an aggressive but (we believe) implementable 
integration configuration: 256 registers, and a lK-entry, 4- 
way IT. In this section, we measure the performance of 
both more conservative (in terms of associativity and size) 
and more aggressive configurations. The former shows 
how much performance can be achieved at lower cost, the 
latter measures the performance limits of integration. 

Integration associativity. The left side of Figure 6 com- 
pares our standard 4-way configuration with 1-way, 2-way 
and fully associative ITs. The number of IT entries is fixed 
at IK. We use 256 registers for the low-associativity 
experiments IK registers for the fully-associative one. 

Low associativity does not significantly degrade inte- 
gration's performance impact. While low-associativity 
reduces the number of integrations, it also reduces the 
number of mis-integrations. On the other end, full associa- 
tivity increases the number of mis-integrations. As a 
result, while most programs benefit from full associativity 



in ideal settings, only few (t.g.,perl.d) show dramatic ben- 
efits in realistic scenarios. Mis-integrations dampen the 
effects of associativity — performance improvement only 
drops to 6.5% and 5.3% when associativity is reduced to 
2-way and 1-way respectively, but only increases to 10% 
when full associativity is used. 

Low associativity primarily reduces direct integrations. 
Direct integrations of common opcode/immediate combi- 
nations (e.g., ldq/0, addq/-) occur at many different degrees 
of temporal locality (e.g., an integrating ldq/0 instance may 
be separated by ten ldq/0 instances from the instance 
whose register it integrates). Although it uses a limited 
number of opcodes (ldq, ldl, Ida) and immediates (0, 4, 8, 
etc.), reverse integration is surprisingly insensitive to IT 
associativity. The reason is that speculative memory 
bypassing exploits a different form of locality than reuse. 
Here, there is a one-to-one correspondence between the 
instructions that create IT entries (stores and stack-pointer 
decrements) and those that read them (loads and stack- 
pointer increments). The stack-frame layout provides a 
natural indexing of entries (ldq/0, ldq/8, etc.), which elimi- 
nates IT conflicts within a function. Our call-depth 
enhancement extends conflict avoidance to span multiple 
call levels (ldq/0/1, ldq/8/1, ldq/0/2, ldq/8/2, etc.). 

Integration table size. The right side of Figure 6 shows 
the performance of fully-associative, LRU-managed ITs of 
four increasing sizes: 64, 256, IK (our default), and 4K 
entries. The register file size is 4K for all experiments. 
These experiments measure a program's inherent integra- 
tion temporal locality, the dynamic instruction distances 
across which integration takes place. 

Both direct and reverse integration are temporally local 
phenomena. There are occasional high integration concen- 
trations at specific long distance values (e.g., crafty, vor- 
tex). Long-range direct integrations take place within 
large-body loops (e.g., outer loops); long range reverse 
integrations take place across large or multiple calls. 



FIGURE 6. Impact of IT associativity and size 
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FIGURE 7. Impact of integration on reduced-complexity execution engines 



Although not directly shown, even with a 4K-entry IT, 
at least 93% of the integrations in each benchmark are of 
results that were created or integrated within the previous 
256 instructions. Low associativity artificially increases 
this locality even more, by premature entry eviction. These 
factors motivates our baseline configuration choice of 256 
registers. With our 4-way IT, increasing the register count 
from 256 to IK yields an average speedup of only 0.5%. 

4.5 Trading Integration for Execution Resources 

Integration streamlines the execution stream. We now 
investigate whether this effect enables the use of lower- 
complexity execution cores. Reduced core complexity 
could be further parlayed into increased core frequency, 
but we do not evaluate such possibilities here. Trading 
execution resources for integration resources is not a sim- 
ple case of moving complexity from one place to another. 
The out-of-order core is sensitive to both latency — depen- 
dent instructions execute serially — and complexity. Inte- 
gration is latency and complexity insensitive. Dependent 
instructions can be integrated in parallel, and integration 
can be pipelined with hazards resulting only in lost inte- 
gration opportunities [13]. Our integration distance results 
(Section 4.3) suggest that this cost is minimal in reality. 

Two main factors contribute to execution complexity: 
1) issue width influences the complexity of the scheduler 
and the bypass network, 2) number of reservation stations 
determines the complexity of scheduling and wakeup. 
Integration reduces pressure on both factors. Our sample 
integration configuration executes 15% fewer instructions 
and 27% fewer loads than a comparable integration-less 
machine (we do not count DIVA re-executions here). The 
average reservation station occupancy — the per-cycle 
number of busy slots — is reduced by 13%, from 31 to 27. 

Figure 7 shows the results of four experiments. Base 
(left bar) is our base configuration: 4-way issue with 40 
reservation stations. RS (second) is a 4-way issue configu- 
ration with 20 reservation stations. IW (third) is an asym- 
metric configuration with a 4-wide in-order section and 3- 
way issue with a single load/store issue port. IW+RS (last) 
has both reduced issue capabilities and fewer reservation 
stations. The bars show speedups relative to the base con- 
figuration without integration. Obviously, without integra- 
tion, IW, RS, and IW+RS show negative speedups. 

Reducing issue width from 4 to 3 (IW) degrades per- 
formance by an average of 12%, with load/store-intensive 



programs (e.g., eon.k, perl, vortex) hit hardest. Integration 
brings performance back to within 3% of baseline. Perfor- 
mance recovery is not uniform across all benchmarks: an 
integration rate of 15% cannot compensate for the loss of 
one load/store port in eon.k (loads and stores comprise 
45% of its dynamic instructions). Reducing the number of 
reservation stations from 40 to 20 (RS) yields an average 
performance loss of 10% (our initial choice of 40 slots sits 
just above the "knee" of the performance-sensitivity 
curve). Integration brings performance to within 2% of 
baseline. The combined effects of reduced issue width and 
buffering (IW+RS) are not additive, but neither do they 
completely overlap. While having fewer instructions in the 
reservation stations translates into fewer ready-to-execute 
instructions per cycle, the reduced execution bandwidth 
decreases the rate at which instructions exit the reservation 
stations, increasing the pressure on that resource. The per- 
formance degradation of this configuration relative to base 
is 18%. Integration is rarely able to compensate for drastic 
reductions in both resources, bringing average perfor- 
mance only to within 9% of base levels. However, note 
that our integration configuration streamlines the execu- 
tion stream by an average of 15% whereas these two sim- 
plifications combine for a 63% reduction in resources. 

5 Related Work 

Dynamic instruction reuse (IR) [14, 15] implements 
general and squash reuse using a table that buffers recent 
computations. IR and direct integration are analogs. IR is 
natural for microarchitectures that use value-based renam- 
ing — storing non-speculative results in a register file and 
in-flight results in the ROB — like Intel's PentiumPro. Inte- 
gration is natural for processors that use pointer-based 
renaming — storing all results in large uniform pool of 
physical registers — Intel's Pentium4 [4]. Integration lever- 
ages the natural advantages of the pointer-based style, 
avoiding actual data movement in favor of map table 
manipulations. The single-assignment form of this style 
also allows integration to implement dependence-tracking 
naturally. Other instruction-granularity reuse implementa- 
tions include instruction-level reuse [8], which tests for 
reuse at both rename and issue, dynamic control-indepen- 
dence (DCI) buffer [2], which uses a shadow ROB to per- 
form squash reuse, and functional unit memoization [3]. 

Unified renaming [5] uses map table manipulations to 
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implement register sharing and reference counting as its 
sharing discipline. While integration uses dataflow equiva- 
lence to find sharing opportunities, unified renaming col- 
lapses identity instruction sequences like register moves 
(detected non-speculatively) and communicating store- 
load pairs (detected via a memory dependence predictor). 

The original speculative memory bypassing operation 
[9] uses address-based dependence prediction and success- 
fully connects a load-consumer with a store-producer if 
both instructions are simultaneously active and if the 
store-producer output register is still mapped when the 
load is renamed. Unified renaming [5] assimilates this 
functionality. The value address association structure 
(VAAS) [10] tags registers with reference addresses and 
implements bypassing (among other optimizations) using 
associative address matching at the data-cache access 
stage. Speculative memory cloaking [9], also called mem- 
ory renaming [16], is a sub-component of bypassing in 
which a store-load pair is transformed into a register move 
(bypassing eliminates the register move, too). The stack 
value file ( SVF) [7] implements memory renaming for the 
stack. Given register integration, speculative memory 
bypassing can be implemented for free (albeit for stack 
references only) via the use of reverse entries. This formu- 
lation exploits hardwired knowledge of the save-restore 
idiom and the register dataflow of the stack pointer to 
replace memory-communication prediction and/or asso- 
ciative address matching and naturally skips the intermedi- 
ate cloaking step. No auxiliary value structures are needed 
and no values are moved, communication happens via 
redirection to existing values. Our register-dataflow based 
implementation has additional advantages in that it does 
not require the store-producer to still be in the window or 
its data register to still be mapped when the load-consumer 
is renamed and in that it can deal with arbitrary stack 
depths and connect parties in recursive callgraphs. 

6 Conclusions 

Register integration performs instruction-level result 
reuse by manipulating the register renaming table. To date, 
integration has been used to implement squash [11, 13] 
and pre-execution reuse [12]. In this paper, we broaden its 
applicability and performance impact by introducing three 
extensions. Our first extension, a register reference count- 
ing scheme that enables multiple active instructions to 
simultaneously share a single register, implements general 
reuse: reuse of results from squashed instructions, active 
in-flight instructions, retired instructions, and even instruc- 
tions whose values have been logically overwritten by 
newer retired instructions. Second, opcode-based IT 
indexing exposes more integration opportunities than the 
original PC-based organization. Finally, reverse integra- 
tion supports integration of results by operations that are 
inverses of previously executed operations— a load is inte- 
grated if the program has executed the inverse store — and 
enables more dataflow-graph compression than conven- 
tional reuse. Here, we use it to obtain a free implementa- 
tion of speculative memory bypassing for stack loads. 

Simulation results using the SPEC2000 benchmarks 



show that using a lK-entry, 4-way IT, these extensions 
increase the integration rate, the number of retired instruc- 
tions that bypass the execution engine, to an average of 
15%. On a 4-wide processor this translates into a 7% aver- 
age speedup. Speedups of 5% and 6% can be achieved 
with simpler, direct-mapped and 2-way tables, respec- 
tively. Higher speedups can be achieved with more accu- 
rate mis-integration suppression. 

Since integration reduces execution engine load, its 
presence allows the use of lower-complexity out-of-order 
core designs. This is not a case of simply moving com- 
plexity from one part of the pipeline to another. The execu- 
tion core is latency-sensitive, it must execute dependent 
chains of operations serially. Integration is latency- insensi- 
tive, chains of dependent operations can be integrated in 
parallel. We show that a lK-entry, 4-way integration con- 
figuration can compensate for a 25% reduction in issue 
width or a 50% reduction in issue buffering. 
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